Load the Data

Prosper is an online platform where people can request to borrow money and investors can fund these requests. Prosper manages the marketplace of both borrowers and investors, determines credit guidelines, and keeps a percentage of the revenue. The data set contains 113,937 loans with 81 variables.

There are many data points to analyze however I’m interested in causes for delinquent and or charged off accounts? Are there relationships between variables to explain these derogatory accounts? If so, what are they and can they help mitigate future losses with more reliable borrowers?

For those who are curious to know what specific variables we can work with, I’ve taken the liberty to printing them out for your viewing pleasure alphabetically in table() format.

## 
##                    AmountDelinquent             AvailableBankcardCredit 
##                                   1                                   1 
##                 BankcardUtilization                         BorrowerAPR 
##                                   1                                   1 
##                        BorrowerRate                       BorrowerState 
##                                   1                                   1 
##                          ClosedDate                         CreditGrade 
##                                   1                                   1 
##               CreditScoreRangeLower               CreditScoreRangeUpper 
##                                   1                                   1 
##                  CurrentCreditLines                CurrentDelinquencies 
##                                   1                                   1 
##                    CurrentlyInGroup                    DateCreditPulled 
##                                   1                                   1 
##                   DebtToIncomeRatio             DelinquenciesLast7Years 
##                                   1                                   1 
##                    EmploymentStatus            EmploymentStatusDuration 
##                                   1                                   1 
##             EstimatedEffectiveYield                       EstimatedLoss 
##                                   1                                   1 
##                     EstimatedReturn             FirstRecordedCreditLine 
##                                   1                                   1 
##                            GroupKey                         IncomeRange 
##                                   1                                   1 
##                    IncomeVerifiable                InquiriesLast6Months 
##                                   1                                   1 
##         InvestmentFromFriendsAmount          InvestmentFromFriendsCount 
##                                   1                                   1 
##                           Investors                 IsBorrowerHomeowner 
##                                   1                                   1 
##                         LenderYield           ListingCategory..numeric. 
##                                   1                                   1 
##                 ListingCreationDate                          ListingKey 
##                                   1                                   1 
##                       ListingNumber           LoanCurrentDaysDelinquent 
##                                   1                                   1 
##       LoanFirstDefaultedCycleNumber                             LoanKey 
##                                   1                                   1 
##          LoanMonthsSinceOrigination                          LoanNumber 
##                                   1                                   1 
##                  LoanOriginalAmount                 LoanOriginationDate 
##                                   1                                   1 
##              LoanOriginationQuarter                          LoanStatus 
##                                   1                                   1 
##                   LP_CollectionFees                 LP_CustomerPayments 
##                                   1                                   1 
##        LP_CustomerPrincipalPayments               LP_GrossPrincipalLoss 
##                                   1                                   1 
##                  LP_InterestandFees                 LP_NetPrincipalLoss 
##                                   1                                   1 
##     LP_NonPrincipalRecoverypayments                      LP_ServiceFees 
##                                   1                                   1 
##                           MemberKey                  MonthlyLoanPayment 
##                                   1                                   1 
##                          Occupation               OnTimeProsperPayments 
##                                   1                                   1 
##                     OpenCreditLines               OpenRevolvingAccounts 
##                                   1                                   1 
##         OpenRevolvingMonthlyPayment                       PercentFunded 
##                                   1                                   1 
## ProsperPaymentsLessThanOneMonthLate     ProsperPaymentsOneMonthPlusLate 
##                                   1                                   1 
##            ProsperPrincipalBorrowed         ProsperPrincipalOutstanding 
##                                   1                                   1 
##               ProsperRating..Alpha.             ProsperRating..numeric. 
##                                   1                                   1 
##                        ProsperScore            PublicRecordsLast10Years 
##                                   1                                   1 
##           PublicRecordsLast12Months                     Recommendations 
##                                   1                                   1 
##              RevolvingCreditBalance         ScorexChangeAtTimeOfListing 
##                                   1                                   1 
##                 StatedMonthlyIncome                                Term 
##                                   1                                   1 
##          TotalCreditLinespast7years                      TotalInquiries 
##                                   1                                   1 
##                   TotalProsperLoans          TotalProsperPaymentsBilled 
##                                   1                                   1 
##                         TotalTrades  TradesNeverDelinquent..percentage. 
##                                   1                                   1 
##             TradesOpenedLast6Months 
##                                   1

Univariate Plot Section

After quickly playing with the data set I realized running summary(), na.omit, sd(), var(), range(), and quantile() becomes real tedious so I created a function, sumFun(), which executes everything just mentioned in a few words of code. If you care to see the construction of the function please reference the data Set.Rmd file.

Bar Plot of Loan Status

Lets look at a at bar plot of Loan Status to see if the visual gives us anything we can sink out teeth into?

The plot does show us completed and current loans are more frequent than derogatory items. What’s interesting is charge offs are more frequent than defaults. I had imagined far greater defaults and less charge offs through loan work out’s, lengthier terms, or forbearance agreements. Good plot to get us started.

Borrower APR

Let’s begin with running our sumFun() function to better understand interest rates charged to borrowers.

Borrower APR Summary via sumFun() function.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229 
## [1] "Standard Deviation 0.0804"
## [1] "Variance 0.0065"
## [1] "Range 0.0065" "Range 0.5123"
## [1] "Quantile:"
##      0%     25%     50%     75%    100% 
## 0.00653 0.15629 0.20976 0.28381 0.51229

Okay, so we have a wide spectrum of interest rate charges. Lets place the interest rates in a table and break them up by 0.5 % increments.

## apr_cut
## (0.0065,0.0115] (0.0115,0.0165] (0.0165,0.0215] (0.0215,0.0265] 
##               3               7               8               2 
## (0.0265,0.0315] (0.0315,0.0365] (0.0365,0.0415] (0.0415,0.0465] 
##               6               1               5               2 
## (0.0465,0.0515] (0.0515,0.0565] (0.0565,0.0615] (0.0615,0.0665] 
##               9              11              77             272 
## (0.0665,0.0715] (0.0715,0.0765] (0.0765,0.0815] (0.0815,0.0865] 
##             627             457             724            1131 
## (0.0865,0.0915] (0.0915,0.0965]  (0.0965,0.102]   (0.102,0.107] 
##            1474            1469            1324             641 
##   (0.107,0.112]   (0.112,0.117]   (0.117,0.122]   (0.122,0.127] 
##             890            1439            1573            2603 
##   (0.127,0.132]   (0.132,0.137]   (0.137,0.142]   (0.142,0.147] 
##            2191            2067            2238            2221 
##   (0.147,0.151]   (0.151,0.157]   (0.157,0.162]   (0.162,0.167] 
##            2567            2584            2735            2283 
##   (0.167,0.172]   (0.172,0.177]   (0.177,0.182]   (0.182,0.186] 
##            2538            3292            2501            2797 
##   (0.186,0.192]   (0.192,0.197]   (0.197,0.202]   (0.202,0.207] 
##            2485            2541            2397            3250 
##   (0.207,0.212]   (0.212,0.216]   (0.216,0.222]   (0.222,0.227] 
##            2208            2866            2090            2086 
##   (0.227,0.232]   (0.232,0.237]   (0.237,0.242]   (0.242,0.246] 
##            2357            2203            1772            2064 
##   (0.246,0.252]   (0.252,0.257]   (0.257,0.262]   (0.262,0.267] 
##            2351            1897            1567            1117 
##   (0.267,0.272]   (0.272,0.277]   (0.277,0.282]   (0.282,0.287] 
##            1437            2511             897            1402 
##   (0.287,0.292]   (0.292,0.296]   (0.296,0.301]   (0.301,0.306] 
##            2001            3014            1452            2251 
##   (0.306,0.311]   (0.311,0.317]   (0.317,0.322]   (0.322,0.327] 
##            1458            1024             801            1307 
##   (0.327,0.332]   (0.332,0.337]   (0.337,0.342]   (0.342,0.347] 
##             667            1177             767             930 
##   (0.347,0.352]   (0.352,0.357]   (0.357,0.361]   (0.361,0.366] 
##             842            2920            4390             298 
##   (0.366,0.371]   (0.371,0.377]   (0.377,0.382]   (0.382,0.387] 
##             143            1396             279              83 
##   (0.387,0.392]   (0.392,0.397]   (0.397,0.402]   (0.402,0.407] 
##             214             117              14              21 
##   (0.407,0.412]   (0.412,0.417]   (0.417,0.422]   (0.422,0.426] 
##              17              55               0               1 
##   (0.426,0.431]   (0.431,0.436]   (0.436,0.442]   (0.442,0.447] 
##               0               0               0               0 
##   (0.447,0.452]   (0.452,0.457]   (0.457,0.462]   (0.462,0.467] 
##               0               0               2               1 
##   (0.467,0.472]   (0.472,0.477]   (0.477,0.482]   (0.482,0.486] 
##               0               0               0               0 
##   (0.486,0.491]   (0.491,0.496]   (0.496,0.501]   (0.501,0.506] 
##               1               0               0               1 
##   (0.506,0.511]   (0.511,0.516] 
##               0               1

The table helps but I want a visual to help me understand the distribution with a simple plot() function.

Much better! We can see there is a normal distribution where majority of interest rates are between 14% and 26%. We can double check by running quantile() on its own.

##      0%     25%     50%     75%    100% 
## 0.00653 0.15629 0.20976 0.28381 0.51229

Lower Credit Score Range Histogram

Lets take a quick look at the distribution of the lower bound credit score ranges using a histogram plot.

We see that most of the borrowers have credit scores above 650 and we do have a number of borrowers below 650 as well.

Upper Credit Score Range Histogram

The distribution is different than for lower bound credit scores but one thing is for sure, prosper.com likes higher credit score borrower.

Borrower Rate

The difference between rate and APR is always part of due diligence for borrowers deciding on financing options. Even though in some situations APR makes decisions extremely confusing. Its always a good idea to understand if the APR amortizes the cost associated with the loan throughout the entire term or just the first year? Simply put, fees for originating loans are only collected once, at time of origination and they never reoccur. In either event its a good idea for us to understand how the consumer see’s their cost. Lets look at the actual rate borrowers are quoted.

Lets begin with a summary.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1340  0.1840  0.1928  0.2500  0.4975 
## [1] "Standard Deviation 0.0748"
## [1] "Variance 0.0056"
## [1] "Range 0"      "Range 0.4975"
## [1] "Quantile:"
##     0%    25%    50%    75%   100% 
## 0.0000 0.1340 0.1840 0.2500 0.4975

No surprise here, the mean borrower rate is .1928 or 19.28% and the mean borrower APR is 21.88%, a mean difference of 2.6%. Now lets look at a more colorful histogram plot of borrower rate.

Lender Yield

What good is interest if there’s no yield. Assuming Prosper is always on-top of usury laws lets understand the amount of return or yields investors/ lenders can anticipate? Can this variable be used later in our analysis to paint a picture for delinquent or charge off accounts?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.0100  0.1242  0.1730  0.1827  0.2400  0.4925 
## [1] "Standard Deviation 0.0745"
## [1] "Variance 0.0056"
## [1] "Range -0.01"  "Range 0.4925"
## [1] "Quantile:"
##      0%     25%     50%     75%    100% 
## -0.0100  0.1242  0.1730  0.2400  0.4925

It’s obvious lender yields will correlate with borrower APR and rate. And the summary does not really tell us any more than we already knew, so for now we’ll leave this variable on the back burner. Lets still plot it for us to have a visual.

Estimated Loss

We all know derogatory accounts will show up as a loss somewhere or another so lets run a summary on estimated loss to see what’s in store for us?

Estimated Loss Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00490 0.04240 0.07240 0.08031 0.11200 0.36600 
## [1] "Standard Deviation 0.0468"
## [1] "Variance 0.0022"
## [1] "Range 0.0049" "Range 0.366" 
## [1] "Quantile:"
##     0%    25%    50%    75%   100% 
## 0.0049 0.0424 0.0724 0.1120 0.3660

The problem here, and I should have caught this earlier is that we are not sure how these losses are calculated, in other words the mean estimated loss is 0.08031 but we’re not sure if this is from total revenue, from lender yield, from gross, or what? This variable may come in handy later on but for now its simply to vague.

Estimated Loss Histogram Plot

Loan Origination Amount

The original loan amount will surely help us navigate our data set. We’ve all been there before, borrowing more than we’re comfortable with. Will loan amounts have an impact on understanding delinquencies? Lets begin with running a summary be ready to combine this variable with others in our exploration to see what role loan amount play?

Loan Origination Amount Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000 
## [1] "Standard Deviation 6245.8006"
## [1] "Variance 39010024.8873"
## [1] "Range 1000"  "Range 35000"
## [1] "Quantile:"
##    0%   25%   50%   75%  100% 
##  1000  4000  6500 12000 35000

The loan amounts are reasonable amounts for borrowers considering debt consolidation. When reviewing Prosper.com this is a main reason for borrowers seeking funding.

Loan Original Amount Histogram Plot

Its not the best looking distribution but it gives us an idea of the direction of the tail.

Debt to Income Ratio

Debt to income ratio is a key variable when creditors and lenders decide on an applicants request for funding. Lets take a look at our summary.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1400  0.2200  0.2759  0.3200 10.0100 
## [1] "Standard Deviation 0.5518"
## [1] "Variance 0.3044"
## [1] "Range 0"     "Range 10.01"
## [1] "Quantile:"
##    0%   25%   50%   75%  100% 
##  0.00  0.14  0.22  0.32 10.01

We see we have outliers of 10.0100, there must be some sort of glitch, if we move our decimal to the right two places these borrowers have 1,000% debt to income ratio, very unlikely. But why is it happening?

Debt to Income Ratio Histogram Plot

When we cut away the outliers the graph look reasonable. More money loans originated to borrowers with lower debt to income ratio’s that higher ratios.

Univariate Analysis

What are the structure of your data set?

There are 113.9 thousand row’s of data and 81 features (variables).

Preliminary observations:

  • 21.88% - mean borrower APR
  • 19.28% - mean borrower rate
  • 18.27% - mean lender yield
  • 08.03% - mean estimated loss
  • $8,337 - mean original loan amount

What are the main features of interest in your data set?

The data set is fascinating and we can subset() and or select() so many variables and dive deeper into our data set. For now I’ll lets focus on loan status in particular derogatory accounts.

What other features in the data set do you think will help support your

investigation into your features of interest?

  • Loan Status
  • Occupation
  • Revolving Credit Balance
  • Bank Card Utilization
  • Income Verifiable
  • Credit Score Range (Lower, Upper)

Did you create any new variables from existing variables in the data set?

Yes, and to be sure we’re on the same page, below we have a new variable as the frequency distribution for the LoanStatus variable. My goal here is to better understand the frequency of derogatory items such as charge offs and defaulted.

##                        loanStatus.freq
## Cancelled                            5
## Chargedoff                       11992
## Completed                        38074
## Current                          56576
## Defaulted                         5018
## FinalPaymentInProgress             205
## Past Due (>120 days)                16
## Past Due (1-15 days)               806
## Past Due (16-30 days)              265
## Past Due (31-60 days)              363
## Past Due (61-90 days)              313
## Past Due (91-120 days)             304

Of the features investigated, were there any unusual distributions?

Did you perform any operations? If so, why?

The debt to income variable is interesting, there are over 200 debt to income ratio of 10. In other words the borrowers debt is 10 times greater than their income. This is unusually high. The operation I performed was to better understand the frequency distribution.

Bivariate Plots Section

Loan Status vs Borrower APR

Great, now lets look at two variables at the same time. We’ll begin with loan status and borrower APR. As we know from above loan status a categorical variable and will play nicely with borrower APR when we plot. Lets also print outliers for us to review.

It’s a difficult plot to make out. Lets reverses our axes for a better looking plot.

## notch went outside hinges. Try setting notch=FALSE.

Much better. We see the different loan status and outliers but still not a clear picture.

Credit Grade vs Borrower APR

Lets see if credit grade has any affect on Borrower APR? We’ll print the outliers as well.

Credit grade ‘A’ is peculiar. After carefully reviewing the graph is seems that AA is a higher credit grade than A however due to logic AA is alphabetically graphed after A. In other-words they are switched around and if you consider the switch it seems lower credit grades have higher interest rates.

Total Credit Lines Past 7 Years vs Loan Status

Lets take a closer look at trade lines and loan status. Can having more trade lines affect delinquency?

There might be something here. It seems that the greater the amount of credit trade lines the less borrowers are likely to default, once borrowers get above 75 trade lines. This is good progress towards our analysis.

Total Trades vs Loan Status

## notch went outside hinges. Try setting notch=FALSE.

The graph above is even more promising for borrowers who historically have large number of trade lines without limiting for number of years as the previous graph showed.

Revolving Credit Balance vs Loan Status

Lets see if revolving credit card balances plays a role in defaults?

Borrowers with greater revolving balances tend to keep their good standings.
We’ll keep these variables into consideration as we move along.

Bivariate Analysis

Talk about some of the relationships. How did the features of interest

vary with other features in the dataset?

My goal is to better understand what variables or patterns play a role in understanding borrower defaults. I began with plotting ‘LoanStatus’ with ‘BorrowerAPR’ to see if we can determine if certain groups of interest rates have greater defaults when compared to others. On the second plot I kept the variable ‘BorrowerAPR’ and but added ‘CreditGrade’ to see if there is some sort of logic behind credit rating and APR. Then I plotted ‘LoanStatus’ with ‘TotalCreditLinesPast7years’ and ‘TotalTrades’. The difference caps trade-lines at 7 years and the later is a historical variable without limiting for specific number of years. The last plot keeps ‘LoanStatus’ and plots ‘RevolvingCreditBalance’.

These features are different than others in the data set because they are more granular in personal detail that make up a borrowers credit history which is a direct reflection on interest rates.

Did you observe any interesting relatinships between the other features?

The first couple of plots did not give us a whole lot to be excited about.
However as we plotted ‘LoanStatus’ against ‘TotalCreditLines’ there are reasons to believe trade lines do play a role in default rates. The greater the number of trade-lines the least likely a borrower defaults. This is especially the case with borrower with 75 or greater historical trade-lines.

What was the strongest relationship you found?

The strongest relationship I found was between ‘TotalTrades’ and ‘LoanStatus’.
It looks as if historically borrowers with greater than ~60 or so trade-lines tend not to default when compared with borrowers with fewer trade lines.

Multivariate Analysis

Lets clear up some of the clutter and filter() ‘Defaulted’ | ‘Chargedoff’ from our LoanStatus variable.

##                ListingKey ListingNumber           ListingCreationDate
## 1 0F483544120452347F48121        577164 2012-04-10 09:14:46.297000000
## 2 0FDF336575182398852A8DC         31540 2006-08-15 12:21:09.433000000
## 3 101B35764008108187B2EBA        758964 2013-04-22 13:29:19.073000000
## 4 10AD33803102786533722BB         93263 2007-02-04 17:24:27.850000000
## 5 0FEA35330893790885F63E8        540655 2011-11-22 14:13:53.397000000
## 6 0F183387070536545CB7D86        120164 2007-04-06 23:07:37.940000000
##   CreditGrade Term LoanStatus          ClosedDate BorrowerAPR BorrowerRate
## 1               36  Defaulted 2012-12-19 00:00:00     0.35797       0.3177
## 2          AA   36 Chargedoff 2008-05-22 00:00:00     0.13202       0.1250
## 3               36  Defaulted 2013-12-26 00:00:00     0.28032       0.2419
## 4          HR   36 Chargedoff 2008-08-08 00:00:00     0.30300       0.2900
## 5               36 Chargedoff 2013-05-22 00:00:00     0.29394       0.2552
## 6           D   36 Chargedoff 2009-11-24 00:00:00     0.23748       0.2300
##   LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn
## 1      0.3077                  0.2896        0.1650          0.1246
## 2      0.1175                      NA            NA              NA
## 3      0.2319                  0.2126        0.1075          0.1051
## 4      0.2850                      NA            NA              NA
## 5      0.2452                  0.2392        0.1190          0.1202
## 6      0.2100                      NA            NA              NA
##   ProsperRating..numeric. ProsperRating..Alpha. ProsperScore
## 1                       1                    HR            5
## 2                      NA                                 NA
## 3                       3                     D            5
## 4                      NA                                 NA
## 5                       3                     D            5
## 6                      NA                                 NA
##   ListingCategory..numeric. BorrowerState    Occupation EmploymentStatus
## 1                        13            FL         Other            Other
## 2                         0                Professional    Not available
## 3                        15            IL Skilled Labor         Employed
## 4                         0                       Other    Not available
## 5                         1            NY  Construction         Employed
## 6                         0            CA         Other        Full-time
##   EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
## 1                      121                True            False
## 2                       NA                True             True
## 3                        3               False            False
## 4                       NA               False            False
## 5                      155                True            False
## 6                       16               False             True
##                  GroupKey              DateCreditPulled
## 1                                   2012-04-10 09:14:41
## 2 EF543365873906104DC9A04 2006-08-15 10:56:13.020000000
## 3                                   2013-04-22 13:29:14
## 4                         2007-02-03 20:57:08.027000000
## 5                                   2011-12-31 13:09:59
## 6 B5F93364617561572C32B6D 2007-03-18 21:48:08.060000000
##   CreditScoreRangeLower CreditScoreRangeUpper FirstRecordedCreditLine
## 1                   700                   719     1999-03-08 00:00:00
## 2                   760                   779     1990-05-19 00:00:00
## 3                   680                   699     1998-06-09 00:00:00
## 4                   520                   539     1997-10-24 00:00:00
## 5                   720                   739     1992-04-20 00:00:00
## 6                   600                   619     1998-12-04 00:00:00
##   CurrentCreditLines OpenCreditLines TotalCreditLinespast7years
## 1                 10               9                         18
## 2                 NA              NA                         36
## 3                  7               7                         34
## 4                 NA              NA                         20
## 5                 18              14                         37
## 6                 12              10                         24
##   OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months
## 1                     5                         300                    0
## 2                    16                         346                    7
## 3                     5                         245                    0
## 4                     0                           0                    1
## 5                    14                         331                    7
## 6                    10                        1402                    3
##   TotalInquiries CurrentDelinquencies AmountDelinquent
## 1              1                    0                0
## 2             17                    2               NA
## 3              0                    0                0
## 4              9                   14               NA
## 5             11                    0                0
## 6              6                    0                0
##   DelinquenciesLast7Years PublicRecordsLast10Years
## 1                       0                        1
## 2                       0                        0
## 3                       0                        0
## 4                       3                        0
## 5                       0                        0
## 6                       0                        0
##   PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization
## 1                         0                   9103                0.97
## 2                        NA                     NA                  NA
## 3                         0                   7975                0.84
## 4                        NA                     NA                  NA
## 5                         0                   8477                0.59
## 6                         0                  47816                0.89
##   AvailableBankcardCredit TotalTrades TradesNeverDelinquent..percentage.
## 1                     178          17                               0.94
## 2                      NA          NA                                 NA
## 3                     607          22                               1.00
## 4                      NA          NA                                 NA
## 5                    4444          27                               0.96
## 6                    5874          23                               1.00
##   TradesOpenedLast6Months DebtToIncomeRatio    IncomeRange
## 1                       0              0.49 $50,000-74,999
## 2                      NA              0.12  Not displayed
## 3                       0              0.39 $25,000-49,999
## 4                      NA              0.27  Not displayed
## 5                       0              0.15      $100,000+
## 6                       0              0.59 $50,000-74,999
##   IncomeVerifiable StatedMonthlyIncome                 LoanKey
## 1             True           5500.0000 3EE2364952142596779635D
## 2             True           5833.3333 51453366538336630763636
## 3             True           2500.0000 D0623679715048926AB9F4D
## 4             True            416.6667 2D433381278297817311230
## 5             True           9166.6667 3F2936403268585514E9BDD
## 6             True           3891.6667 0378338759198235096D61F
##   TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments
## 1                NA                         NA                    NA
## 2                NA                         NA                    NA
## 3                NA                         NA                    NA
## 4                NA                         NA                    NA
## 5                NA                         NA                    NA
## 6                NA                         NA                    NA
##   ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## 1                                  NA                              NA
## 2                                  NA                              NA
## 3                                  NA                              NA
## 4                                  NA                              NA
## 5                                  NA                              NA
## 6                                  NA                              NA
##   ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## 1                       NA                          NA
## 2                       NA                          NA
## 3                       NA                          NA
## 4                       NA                          NA
## 5                       NA                          NA
## 6                       NA                          NA
##   ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## 1                          NA                       140
## 2                          NA                      2239
## 3                          NA                       193
## 4                          NA                      2161
## 5                          NA                       414
## 6                          NA                      1688
##   LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber
## 1                             8                         23      63982
## 2                             1                         91       2711
## 3                             8                         11      89204
## 4                            18                         85       7286
## 5                            17                         26      59180
## 6                            31                         83      13475
##   LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter
## 1               4000 2012-04-19 00:00:00                Q2 2012
## 2              10000 2006-08-22 00:00:00                Q3 2006
## 3               2000 2013-04-26 00:00:00                Q2 2013
## 4               1500 2007-02-09 00:00:00                Q1 2007
## 5               2000 2012-01-20 00:00:00                Q1 2012
## 6              23500 2007-04-26 00:00:00                Q2 2007
##                 MemberKey MonthlyLoanPayment LP_CustomerPayments
## 1 924335448287456229847ED             173.71              521.13
## 2 893E3365582060699EBBADA             334.54             5325.33
## 3 93173576199169167625E6A              78.67              236.01
## 4 5CCA33798155110380A1C68              62.86              817.18
## 5 42923531265816888629D0D              80.07              890.63
## 6 E36E33838034687974BB585             909.68            23511.66
##   LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees
## 1                       209.75             311.38          -9.81
## 2                      3987.33            1338.00         -54.61
## 3                       117.73             118.28          -4.89
## 4                       403.08             414.10          -7.14
## 5                       430.97             459.66         -17.43
## 6                     14754.94            8756.72        -324.49
##   LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
## 1              0.00               3790.25             3790.25
## 2              0.00               6012.65             6012.67
## 3              0.00               1882.27             1882.27
## 4              0.00               1096.92             1096.92
## 5              0.00               1569.03             1569.03
## 6           -523.49               8787.08             8745.06
##   LP_NonPrincipalRecoverypayments PercentFunded Recommendations
## 1                            0.00             1               0
## 2                          268.96             1               0
## 3                            0.00             1               0
## 4                            0.00             1               0
## 5                            0.00             1               0
## 6                            0.00             1               0
##   InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
## 1                          0                           0        10
## 2                          0                           0        85
## 3                          0                           0        30
## 4                          0                           0        15
## 5                          0                           0         2
## 6                          0                           0       384

The new variable above looks great. Now lets plot CreditScoreRangeUpper, StateMonthlyIncome, and LoanStatus.

It’s a clean plot but it doesn’t tell us much. Let’s use the dplyr package to group our new variable by ‘LoanStatus’, ‘StatedMonthlyIncome’, ‘TotalTrades’, ‘CreditScoreRangeUpper’ and we’ll calculate the mean and median for ‘BorrowerAPR’ into a new variable. Lets print out a few rows to make sure all is well.

## # A tibble: 6 x 7
## # Groups:   LoanStatus, StatedMonthlyIncome, TotalTrades [1]
##   LoanStatus StatedMonthlyIncome TotalTrades CreditScoreRangeUpper
##       <fctr>               <dbl>       <dbl>                 <int>
## 1 Chargedoff                   0           1                   539
## 2 Chargedoff                   0           1                   619
## 3 Chargedoff                   0           1                   679
## 4 Chargedoff                   0           1                   719
## 5 Chargedoff                   0           1                   739
## 6 Chargedoff                   0           1                   759
## # ... with 3 more variables: BorrowerAPR.mean <dbl>,
## #   BorrowerAPR.median <dbl>, n <int>

Lets lot the new variable and see if we can build on our assumption from earlier that borrowers with less than 60 trade lines are more likely to have defaults and charge offs. We’ll run both log10 and coord_trans to see the visual difference in our plots.

Our coord_trans plot.

Here is the same three variables but with an added elipsed layer.

Now lets plot our new variable but change the our x-axis to ‘CreditScoreRangeUpper’.

The plot above seems over plotted, lets revise is below.

Much better. We see that borrowers above 60 or so trade lines have fewer delinquencies or charge offs.

BorrowerAPR, CreditGrade, LoanStatu

Lets take a slight detour here and look at box plots. Here we’ll plot Credit Grade, BorrowerAPR, and LoanStatus.

dat4 <- df %>% 
  tibble::rownames_to_column(var="outlier") %>% 
  group_by(CreditGrade) %>% 
  mutate(is_outlier=ifelse(is_outlier(BorrowerAPR), BorrowerAPR, as.numeric(NA)))

dat2$outlier[which(is.na(dat$is_outlier))] <- as.numeric(NA)

ggplot(aes(CreditGrade, BorrowerAPR), data = subset(dat4, LoanStatus == 'Defaulted' | LoanStatus == 'Chargedoff')) +
  geom_boxplot(color = '#00cc00') +
  geom_text(aes(label=is_outlier, colour = LoanStatus),na.rm=TRUE,nudge_y=0.05, 
            check_overlap = TRUE, size = 3) +
  scale_y_continuous(limits = c(0, .55), breaks = seq(0, .55, .05)) 

The plot is a little crowded. Lets see if we can revise it.

ggplot(aes(ProsperRating..Alpha., BorrowerAPR, fill=LoanStatus), data = 
         subset(df, LoanStatus == 'Defaulted' | LoanStatus == 'Chargedoff')) +
  geom_boxplot() +
  scale_y_continuous(limits = c(0, .55), breaks = seq(0, .55, .05))

That’s nice. We see where our outliers are within each credit grade tier.

Talk about some of the relationships you observed in this part of the  

investigation. Were there features that strengthened each other in terms

of  looking at your feature(s) of interest?

I created a new variable using the filter() method to clear up the clutter from the original data frame and in particular the ‘LoanStatus’ variable. This variable had 12 different categories so it was a good idea. Our new filtered variable was still too cumbersome so I took advantage of the group_by() method in dplyr to only look at LoanStatus, StatedMonthlyIncome, TotalTrades, CreditScoreRangeUpper and summarized the mean and median for BorrowerAPR. The importance of this process was it allowed me to isolate the features explicitly to look and allowed a more granular look at the remainder of my analysis.

Were there any interesting or surprising interactions between features?

Yes there are a number of surprising interactions. We see that at an alpha of 1/10 there are distinct vertical lines in the BorrwerAPR.mean axis. In other words there are noticeable amounts of defaulted and charged off loans at approximately .30, .36 and .37 BorrowerAPR.mean and interestingly enough these lines fade once borrowers reach 60 trade lines.

Final Plot and Summary

Plot One

Description

The reason why I choose the plot above is because it shows two interesting finds.
First, its shows that borrowers with 60 or greater trade lines are less likely to go into default or charge off status. Second, we notice three defined vertical lines that suggest borrowers paying above 30% APR are more likely to go into default and charge off status.

Plot two

ggplot(subset(df,LoanStatus == 'Defaulted' | LoanStatus == 'Chargedoff'), 
       aes(factor(CreditScoreRangeUpper), TotalTrades, fill = LoanStatus)) +
  geom_boxplot() +
          theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  ylim(0,60) +
    labs(color = 'Loan Status', y = 'Number of Historic Trade Lines Open', 
       x = 'Upper Credit Score Range', 
       title = 'Loan Status in Relation to Loan Status and Trade Lines')
## Warning: Removed 3156 rows containing non-finite values (stat_boxplot).

Description

We see our median trade lines are about 30 and outliers in the 60 range.
We also notice defaults and charge offs are common among all credit profiles.
What is interested and to our point is that there are not a whole lot of outliers above 60 therefore borrowers with 60 plus trade lines would seem to be less riskier.

Plot three

Description

Whats interested about the plot above is that we see a spike of number of loans originated where the lender yield is slightly above 30%. As we saw in plot one These are the borrowers who visually have more defaults and charge offs.

Refelction

What were some of the struggles that you went through?

Plotting on the correct axes was a challenge at first especially when categorical and continuous variables were being plotted. I also did not enjoy running summaries and different numerical measures such as range(), then I created the sumFun() function and this helped repeating the same lines of code over and over.

What went well?

Describing the plots went well. Labeling the axes, title, and legends was fun and rewarding knowing the plot was finally complete.

What was surprising?

What was surprising borrowers who have greater than 60 trade lines have fewer defaults and charge offs. Its a surprise because credit score was not much of a factor here. People on both ends of the credit spectrum (high and low scores) performed better when they had 60 or more trade lines.

Insight on Future Work?

The data set is rich with variables. You can spend months analyzing different variables and their relation with one another. I’d be interested in diving deeper to sort through each variable in relation to loan status. I’m sure there are more variables other than trade lines that can help us better understand borrower patterns into default and charge offs.